マッスィブに並列処理するプロセッサのプログラミング：実践的なアプローチ：CUDAの実行モデル：ホストとデバイスの比較

CUDAの実行モデルは、あなたのコンピュータを高性能な異種システムに変換します。次のように想像してください： 大指揮者（ホスト／CPU） そして 数千人の軍隊（デバイス／GPU）大指揮者は複雑な論理や意思決定を担当し、軍隊は膨大で繰り返し行われるタスクを同時に実行します。

1. 構造上の違い

ホスト ホスト はレイテンシ最適化されたCPUであり、複雑な制御フローと逐次的タスクに適しています。逆に、 デバイス デバイスはスループット最適化されたGPUで、数多くの単純なコアを内蔵しており、巨大なデータセットに対して同じ命令を同時に実行するように設計されています。

2. 実行のリズム

CUDAプログラムは一連のフェーズとして機能します。実行は"逐次コード"のためにホスト上で開始されます。プログラムが"並列カーネル"に到達すると、 グリッド スレッドのグリッドをデバイスに起動します。デバイスが膨大なワークロードを終了すると、制御はホストに戻ります。

3. パフォーマンスの特化

このモデルは両方の長所を活用します：CPUはシステムリソースや複雑な分岐を管理し、一方でGPUは SPMD（単一プログラム、多数データ） ロジックによってデータ要素を並列処理します。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Which architecture is characterized as being 'throughput-optimized'?

The Host (Intel® CPU)

The Device (NVIDIA® GPU)

The System RAM

The PCIe Bus

QUESTION 2

The reader should complete Part 1 of the MatrixMultiplication() example in Figure 3.6 with similar declarations of an Nd and a Pd pointer variable as well as their corresponding cudaMalloc() calls. Furthermore, Part 3 in Figure 3.6 can be completed with mandatory calls.

float *Nd, *Pd; cudaMalloc((void**)&Nd, size); ... cudaFree(Nd);

float Nd, Pd; malloc(&Nd, size); ... free(Nd);

float *Nd, *Pd; cudaMemcpy(Nd, Pd, size); ... delete Nd;

int Nd, Pd; Nd = new float[size]; ... free(Nd);

QUESTION 3

In the CUDA execution model, where does a program always begin its execution?

On the Device (GPU)

Simultaneously on both

On the Host (CPU)

In the Global Memory

QUESTION 4

What happens when the Host encounters a phase with rich data parallelism?

It speeds up its clock frequency.

It launches a Kernel onto the Device.

It stores the data in the Host Cache.

It converts the code to Python.

QUESTION 5

A student attempts to launch a 1024x1024 matrix multiplication on G80 hardware using 1024 blocks, where each thread calculates one element. Why will this fail?

The G80 cannot handle 1024 blocks.

The total number of threads exceeds 1 million.

The configuration results in 1024 threads per block, exceeding the 512 hardware limit.

Matrix multiplication is not data parallel.